2/28/23
Q: Understanding the meaning behind the heat maps. How is it collinear regarding race and not age? What would collinearity look like if it were correlated with age?
A: Great question - we’re going to continue with this today!
Q: Are we allowed to use the analysis we use in class, for our analysis for the case study? As long as we understand it, can we just copy and paste?
A: Kind of. What we discuss in class will help answer Q2 about collinearity…and will get you well on your way to answering Q1 about the relationship between RTC laws and violent crime. So, yes, you can copy+paste anything from class that you want, but you’ll need to additionally make some decisions to fully answer the questions.
Q: I am confused about if our group accidentally works on the same lines/parts of the file for cs01, will our work be overwritten by those who push later? I’m not familiar to GitHub so I may need to explain more.
A: If you work on the exact same lines, yes, this will cause a “merge conflict” …so it will not silently overwrite what someone else did…but it will force you to decide whose version you want to keep. Best solution is to not work on the same parts of the file at the same time.
Last Lecture: Life Lessons That Have Nothing to Do with Data or Science A UCSD Data Science Education will teach you a lot. There will be programming, data, dataviz, statistics, machine learning, linear algebra, ethics, capstone projects, and domain knowledge galore. But, these courses will not teach you the very specific lessons that Prof Ellis has learned along her journey. Come hear the advice that took Prof Ellis decades to receive, their surrounding stories, and the lessons she hopes you learn faster than she did in one jam-packed chat.
The Last Lecture Series is a huge opportunity for students to gain some insight about a Professor’s journey and the obstacles they had to overcome to get to where they are today, especially coming from professors who have reached success in the field of data science. We highly encourage you to attend!
This event will be happening at 5pm on Wednesday, 3/1.
We’ll be hosting this at the SDSC Auditorium! Registration is no longer needed.
Due Dates:
Notes:
EDA Example #1: Shenova
EDA Example #2
p2 <- DONOHUE_DF |>
group_by(STATE) |>
summarise(RTC_LAW_YEAR=RTC_LAW_YEAR) |>
distinct() |>
ggplot(aes(x=RTC_LAW_YEAR)) +
geom_bar() +
scale_x_continuous(
breaks = seq(1980, 2015, by = 1)
) +
labs(
title = "Distribution of RTC Law Years",
x = "RTC Law Year", y = "Count"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90), plot.title.position = "plot")
p2EDA Example #3: Sebastian
library(maps)
# load state map data
states_map <- map_data("state")
# merge state map data with DONOHUE_DF data
DONOHUE_DF_map <- merge(states_map, DONOHUE_DF, by.x = "region", by.y = "STATE")
# plot map using ggplot2
ggplot() +
geom_polygon(data = DONOHUE_DF_map, aes(x = long, y = lat, group = group, fill = cut(Viol_crime_rate_1k, breaks = c(0, 3, 4, 5, 6, 7, 8, 9, 10), labels = c("0-3", "3-4", "4-5", "5-6", "6-7", "7-8", "8-9", "9-10"))), color = "white", size = 0.1) +
coord_fixed() +
theme_void() +
scale_fill_brewer(name = "Violent Crime Rate per 1k", palette = "YlOrRd", na.value = "white",
labels = c("0-3", "3-4", "4-5", "5-6", "6-7", "7-8", "8-9", "9-10"),
breaks = c("0-3", "3-4", "4-5", "5-6", "6-7", "7-8", "8-9", "9-10")) +
labs(title = "Southern States Have the Highest Violent Crime Rates",
subtitle = "Violent Crime Rates per 1000 in Each State",
caption = "Note: White areas indicate missing data") +
theme(plot.title = element_text(size = 15, face = "bold"),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(size = 8, hjust = 0),
legend.position = "bottom",
legend.title.align = 0.5,
legend.text = element_text(size = 6),
legend.title = element_text(size = 8))Rows: 3,921
Columns: 21
$ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ to_multiple <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cc <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, …
$ sent_email <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, …
$ time <dttm> 2011-12-31 22:16:41, 2011-12-31 23:03:59, 2012-01-01 08:…
$ image <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ dollar <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 0, 0, …
$ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n…
$ inherit <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ password <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ num_char <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.091, 4.837, 7.421…
$ line_breaks <int> 202, 202, 192, 255, 29, 25, 193, 237, 69, 68, 25, 79, 191…
$ format <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
$ re_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, …
$ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 10, 4, 10, 20, 0…
$ number <fct> big, small, small, small, none, none, big, small, small, …
❓ Would you expect longer or shorter emails to be spam??
# A tibble: 2 × 2
spam mean_num_char
<fct> <dbl>
1 0 11.3
2 1 5.44
❓ Would you expect emails that have subjects starting with “Re:”, “RE:”, “re:”, or “rE:” to be spam or not?
num_char) as predictor, but the model we describe can be expanded to take multiple predictors as well.This isn’t something we can reasonably fit a linear model to – we need something different!
\[ y_i ∼ Bern(p) \]
All GLMs have the following three characteristics:
\[logit(p) = \log\left(\frac{p}{1-p}\right)\]
\[p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}\]
In R we fit a GLM in the same way as a linear model except we:
logistic_reg()"glm" instead of "lm" as the enginefamily = "binomial" for the link function to be used in the model# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.80 0.0716 -25.1 2.04e-139
2 num_char -0.0621 0.00801 -7.75 9.50e- 15
Model:
\[ \log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times \text{num_char} \]
\[\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times 2\]
\[\frac{p}{1-p} = \exp(-1.9242) = 0.15 \rightarrow p = 0.15 \times (1 - p)\]
\[p = 0.15 - 0.15p \rightarrow 1.15p = 0.15\]
\[p = 0.15 / 1.15 = 0.13\]
❓ What is the probability that an email with 15000 characters is spam? What about an email with 40000 characters?
❓ Would you prefer an email with 2000 characters to be labelled as spam or not? How about 40,000 characters?
| Email is spam | Email is not spam | |
|---|---|---|
| Email labelled spam | True positive | False positive (Type 1 error) |
| Email labelled not spam | False negative (Type 2 error) | True negative |
False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN)
False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)
| Email is spam | Email is not spam | |
|---|---|---|
| Email labelled spam | True positive | False positive (Type 1 error) |
| Email labelled not spam | False negative (Type 2 error) | True negative |
❓ If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision?
spam_mult <- logistic_reg() |>
set_engine("glm") |>
fit(spam ~ num_char + to_multiple + re_subj, data = email, family = "binomial")
tidy(spam_mult)# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.20 0.0752 -16.0 2.21e-57
2 num_char -0.0686 0.00781 -8.78 1.57e-18
3 to_multiple1 -2.14 0.299 -7.18 6.92e-13
4 re_subj1 -3.12 0.360 -8.66 4.70e-18
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.20 0.0752 -16.0 2.21e-57
2 num_char -0.0686 0.00781 -8.78 1.57e-18
3 to_multiple1 -2.14 0.299 -7.18 6.92e-13
4 re_subj1 -3.12 0.360 -8.66 4.70e-18
\[ \begin{aligned} log_e \left(\frac{p}{1 - p}\right) &= - 1.20 - 0.07 \times \texttt{num_char} \\ &\quad - 2.14\times \texttt{to_multiple}_{\texttt{1}} \\ &\quad - 3.12 \times \texttt{re_subj}_{\texttt{1}} \\ \end{aligned} \]
So for an email with 4,000 characters (4), addressed to a single recipient (0), and that did start with “re:” in the subject line (1)…
\[ \begin{aligned} log_e \left(\frac{p}{1 - p}\right) = - 1.20 - 0.07 \times 4 - 2.14\times 0 - 3.12 \times 1 \end{aligned} \]
\[ \begin{aligned} log_e \left(\frac{p}{1 - p}\right) = - 2.2 \end{aligned} \]
…solve for \(\widehat{p}\)
\[ \begin{aligned} \frac{e^{-2.2}}{1 + e^{-2.2}} = 0.0998 = 9.98\% \end{aligned} \]
9.98% chance that such an email would be spam
Single predictor (num_char)
# A tibble: 1 × 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 2437. 3920 -1173. 2350. 2363. 2346. 3919 3921
Multiple predictors
Introduction to Modern Statistics Chapter 9: Logistic Regression